DSCI 542 Lab 1

General lab instructions

rubric={mechanics:4}

Follow the general lab instructions.

Link to GitHub Repo

In-class activities

rubric={raw:10}

1st activity workseet: Activity 1 Worksheet

2nd activity workseet: Activity 2 Markup Doc

Pre-lab

Before the lab, read David Robinson's blog post about Donald Trump's Twitter account during the 2016 campaign race at least once. The entire lab is based around this post.

Exercise 1: Reactions

1(a)

rubric={reasoning:3}

This blog post is based around a hypothesis. What is the hypothesis?

1a: The hypothesis of the article is that data and text analysis can help determine that a different person is tweeting Donald Trump's tweets on Android and iPhone based on the tweet patterns and content.

1(b)

rubric={reasoning:8}

Report on your reactions to the article. In particular:

Note: A couple sentences is sufficient for each of the three points above. No need to write a lengthy response.

1b:

1(c)

rubric={reasoning:6}

Based on your reactions to David Robinson's post, what is one thing that you think is worth emulating when communicating the results of data analysis? What is one thing worth avoiding?

Recommended answer length: 100 words. Max answer length: 200 words.

1c:
I really liked David Robinson's general brevity and clear messaging when announcing the conclusions of each section of analysis, and want to remind myself to emulate this in the future. I appreciated that he didn't spend an inordinate amount of time on pontification and purple prose about his results. As a blog post, I think this is important because the audience wants the article pacing quick and entertaining, and not an extended deep dive on the content. However, as mentioned earlier in 1a, I think a quick summary before introducing the analysis of some of the more esoteric mechanics of twitter might have been beneficial. Even as a heavy twitter user, I found the reasoning and conclusions behind his tweet wrangling a little confusing. I think it’s important to avoid relying on assuming advanced knowledge from your audience.

Exercise 2: audience

2(a)

rubric={reasoning:6}

Who do you think is the intended audience for David Robinson's post? Describe the target reader in as much detail as you can. Is it targeted at people of specific age group? Nationality? Political inclination? Make sure your answer includes the assumed level of data science skills for the target reader.

Note: In part (b) you'll justify your claims.

2a: I believe the main target audience of David Robinson's post is Data Science practitioners of moderate skill level who would be interested in learning more about sentiment analysis. The article is a subtle advertisement and demonstration for the new R package, tidytext, David and his collaborator Julia Silge have developed. I would argue it's moderate, because beginner data scientists might not be as comfortable with the assumed scraping knowledge, and advanced data scientists would already have their own sentiment analysis workflows. Because of the choice of subject in Donald Trump, we can infer that the audience would need to be quite familiar with Trump, and therefore more likely American, or at least North American. The use and assumed knowledge of the twitter platform and politics theme would indicate that the target age demographic would mostly match the intersection of these groups, which I would estimate to be about young adults to middle-age adults. Finally, while the author tries to stay politically neutral for much of the article, there are hints that the political inclination of the target audience would tend to lean left from the negative connotations of the descriptions of Donald Trump.

2(b)

rubric={reasoning:6}

How do you know what the audience is? For your claims in part (a), give specific examples from the original text that justify your claims. 1-2 sentences per claim is suffucient.

2b

Exercise 3: changing the audience

rubric={reasoning:8,writing:8}

David Robinson's blog post definitely assumes some level of data science knowledge. However, non-data scientists might be interested in his findings as well. Your task is to rewrite a much shorter version of the blog post, this time targeted it toward a reader without data science knowledge. You can assume the reader has the basic knowledge required to understand the argument - they know what an iPhone is, what Android is; they know who Donald Trump is and some basic facts about him; they know what Twitter is and what a tweet is and that @realDonaldTrump is Trump's (former) Twitter handle. However, they have no background in programming, statistics, or data science.

Recommended length: 300 words. Maximum length: 500 words. The original blog post is around 2000 words, so keep in mind that your version needs to be much shorter than the original.

Note: don't refer to David Robinson or David Robinson's post in your post. For the purposes of this exercise, you are pretending to be David Robinson writing a separate post for a different audience. You are welcome to use the first person (e.g. "I analyzed the data") but you don't have to.

Note: while this would normally be considered plagiarism, for this exercise you are welcome to copy from David Robinson's post. In general this is probably a bad idea, since the new audience will probably require a complete rewrite, but we will leave this option open to you for both David Robinson's text and visualizations. If you include visualizations, make sure they render properly in your HTML on Canvas so that the grader can see them. However, again, think carefully about whether including any original material actually makes sense for your new audience.

Does Trump Write His Own Tweets?

I recently discovered a tweet that asks an interesting question: Are Trump's Android and iPhone tweets different people?

The premise is that different tweeting patterns and tweet content between devices indicate separate people, as noted anecdotally by others. People have noted that the Android content mimics Trump's verbal speech patterns, is more negative, uses fewer Twitter mechanics, and that Trump himself uses an Android for tweets.

Conversely, his iPhone tweets tend to report state events, public relations content, and contain more hashtags and linked imagery.

I was excited to perform a data analysis to quantitatively measure the differences between Android and iPhone tweet data to help determine if these are in fact different people!

Tweet Time

The first metric I investigated was whether Android and iPhone tweets were sent at diffent times of the day. From the 762 Android tweets and 628 iPhone tweets available, we see the following time breakdown by platform.

We can see a clear time difference between when the Trump account posts with Android versus the iPhone. The iPhone posts occur during the mid-morning and early evening timeframes, which would hint at work hours. In contrast, the Android platform tweets occur frequently in the early morning and late at night, which coincides with recreational use hours.

Use of Twitter Features

Another key observation is that the iPhone and Android tweets apply very different usage patterns of twitter features such as retweets, hashtags, and picture links. One particularly distinct artifact is the Trump account's anachronistic behavior of “manually retweeting” people by copy-pasting their tweets, then surrounding them with quotation marks.

We can see in the plot below, that almost all of these 'manual retweets' happen on Android, and account for about a third of his Android tweets!

Another twitter feature usage discrepancy between Android and iPhone is the use of pictures or links, demonstrated below:

Here we see that tweets from the iPhone were 38 times as likely to contain either a picture or a link!

Tweet Word Content

Finally, by looking at the word content differences between platforms we can measure differences in sentiment, which is a description of emotion or word connotation. By comparing word sentiment by platform below, I show differences in sentiment between text in Android and iPhone tweets.

Here, we can see that words with negative sentiments (sadness, disgust) occur much more commonly than positive sentiments (joy, trust) in Android tweets!

Conclusion

My analysis concludes that the Android and iPhone tweets are clearly from different people, posting during different times of day and using hashtags, links, and retweets in distinct ways. What’s more, we can see that the Android tweets are angrier and more negative, while the iPhone tweets tend to be benign announcements and pictures.

Submission to Canvas

When you are ready to submit your assignment do the following:

  1. Run all cells in your notebook to make sure there are no errors by doing Kernel -> Restart Kernel and Run All Cells...
  2. Save your notebook.
  3. Convert your notebook to .html format using the convert_notebook() function below or by File -> Export Notebook As... -> Export Notebook to HTML
  4. Run the code submit() below to go through an interactive submission process to Canvas.
  5. Finally, push all your work to GitHub (including the rendered html file).